Instant Replay: Investigating statistical analysis in sports. 
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Technology has had an unquestionable impact on the way people watch sports. Along with this technological 
i— h evolution has come a higher standard to ensure a good viewing experience for the casual sports fan. It can 

! ^be argued that the pervasion of statistical analysis in sports serves to satiate the fan's desire for detailed sports 

Ph statistics. The goal of statistical analysis in sports is a simple one: to eliminate subjective analysis. In this 
^ paper, we review previous work that attempts to analyze various aspects in sports by using ideas from Markov 
^ Chains, Bayesian Inference and Markov Chain Monte Carlo (MCMC) methods. The unifying goal of these 
works is to achieve an accurate representation of the player's ability, the sport, or the environmental effects on the 
l—— 'player's performance. With the prevalence of cheap computation, it is possible that using techniques in Artificial 
t^- Intelligence could improve the result of statistical analysis in sport. This is best illustrated when evaluating 
^ football using Neuro Dynamic Programming, a Control Theory paradigm heavily based on theory in Stochastic 
processes. The results from this method suggest that statistical analysis in sports may benefit from using ideas 
^ from the area of Control Theory or Machine Learning. 
^ Keywords: dynamic programming, markov, baseball, basketball, football 
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1 Introduction 

Over time, statistics and sports have become syn- 
onymous with each other. This synonymity between 
sport and statistics can be best exemplified when ob- 
serving the winner and loser of a game. A team only 
wins by outscoring its opponent. However, in order 
to outscore an opponent, additional contributions from 
non-scoring players on the team are required. As a re- 
sult, to reflect each player's contribution to the team sta- 
tistical categories have been added. Over time, these 
player contributions have been further refined to best 
reflect overall performance. 

An example of this is in basketball when a player 
retrieves the ball after a missed shot. Such a retrieval 
is called a rebound. This statistic has been further re- 
fined in to two categories: defensive rebound and offen- 
sive rebound. A defensive rebound is when the oppos- 
ing team's player shoots and misses and the defending 
team's player successfully retrieves the ball. An offen- 
sive rebound is when a player on the team misses the 
shot, but a player on the same team manages to retrieve 
the ball. This refinement allows observers to assess a 
player's rebounding abilities when playing defense or 
offense. 

This refinement provided by statistical analysis seeks 
to accurately and impartially evaluate player and/or team 
talent. What can make this analysis difficult is the team- 
oriented nature of sports, because player performance 
can be heavily impacted by the team's performance. 
Various authors have attempted to use different tech- 
niques to assess a player's ability, independent from 
their team, with mixed results. 

It is also difficult to assess the performance of a 
team in a hypothesized situation. Such an example would 
be investigating the impact of a certain team's batting 
lineup affecting the outcome (win or loss) of the game. 
Could there have been a different lineup ordering such 
that the team could have won the game? As we will see, 
there has been some work done in attempt to answer this 



question. 

The human element, in conjunction with the statis- 
tical nature of sports, provides a challenge for any re- 
searcher who seeks to investigate consequences of cer- 
tain phenomena. The statistical analysis done in sports 
indirectly tries to assess players' Decision Making when 
facing uncertainty. Many areas of science are interested 
in decision making when facing uncertainty, and good 
results will likely benefit more than just one discipline 
of science. 

The goal of this review is to provide researchers un- 
familiar with statistical analysis in sports with an under- 
standing of what important research has been conducted 
in the area over the last four decades. It is assumed the 
reader is familiar with the methods described in the ap- 
pendices. 

2 Previous Work 

There has been a wide array of statistical analysis 
performed on sports, primarily baseball. In this sec- 
tion, we discuss the important results in the past four 
decades. These results provide an idea of how statisti- 
cal analysis can be used to evaluate desired properties 
of either players, teams or the game itself. 

2.1 Offensive Earned Run Average for Base- 
ball 

In the introduction, it was mentioned that a player's 
performance can be significantly impacted by their team. 
In baseball, this is especially true with statistics such as 
Runs Batted In (RBI), which is credited to a batter when 
the outcome of their at-bat scores a run. This means that 
if the outcome was a fly out and a run is scored, the bat- 
ter is credited with a RBI even if the base runner had the 
speed to out-run the throw to home plate. This shows 
that there exist player statistics in baseball that depend 
on the team's overall abilities. 
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Offensive Earned Run Average (OERA) was cre- 
ated by Cover and Keilers (1977) to accurately assess a 
player's offensive output to avoid situations such as the 
above. That is, they wanted to create a metric that was 
independent from the team's performance. The central 
idea behind OERA is personal innings, which are de- 
fined as innings where the player bats at every position 
in the lineup. The provided example was one where 
a batter who starts his career with the at-bat sequence 
"single, out, double, out , walk, walk, homerun, out" 
generating five runs in this personal inning. The claim 
was that OERA is a measure of "batter effectiveness", 
with units of measurement being the expected number 
of runs scored per game. 

The idea of personal innings is effective when the 
goal is to measure a player's individual offensive contri- 
bution to the team. A player who possesses a high bat- 
ting average, on base percentage and slugging percent- 
age will generate more runs than a player with the same 
on base percentage, slugging percentage but a lower 
batting average. By measuring batting effectiveness in 
such a manner, a player's OERA depends on their abil- 
ity to reach base and advance/score baserunners. 

The batter's cumulative statistics are used to com- 
pute the probability of achieving any of the six hitting 
outcomes. The six outcomes are: strike out, walk, sin- 
gle, double,triple and home run, and their probabilities 
are denoted as Po,Pb,Pi,P2,P3,P4, respectively. The 
expected number of runs is generated by using these 
probabilities, averaged over all the possible sequences 
of hitting performances. 

The model to represent the game state consists of 
8 x 3 = 24 states. This is because when there are zero, 
one, or two, there are exactly 2 3 different positions that 
the base-runners can take. Using a three digit binary 
sequence, where the digit position denotes a base runner 
at that base, the eight states for each out become clear 
(000,00 1 ,0 1 0, 1 00, 1 1 ,0 1 1 , 1 1 0, 1 1 1 ). The third out state 
is ignored since it signifies the end of the inning. 

Let E(s) represent the expected number of runs score 
in an inning when starting in state s. Since there is a 



Markovian Recurrence for E(s), it can be written as: 

E(s) = J^PHWiH, s)) + R(H, s)) = 

H (1) 
^2p(s'\s)E(s') + R(s) 

s> 

where H E {0,5,1,2,3,4} represent the hitting out- 
comes, and s € {0, . . . , 24} are the possible states. The 
resulting state s' from a hit is determined by function 
s' = f(H,s). The number of runs scored by the hit is 
denoted by R(H, s). A state-transition function is de- 
fined aSp(s'|s) = Y.H:f(H,s)=s'VH- 

This expectation also has an equivalent representa- 
tion using the theory of Absorbing Markov Chains: 

E = (/ - Q)^R (2) 

Where Q is the 24 x 24 representing the 24 non absorb- 
ing states previously mentioned, R and E are 24 x 1 
vectors representing the expected number of runs and 
runs in those 24 states. As an example, the first en- 
try in E, E(l), represents the expected number of runs 
earned with no outs and no men on base. Using the 
probabilities (po,Pb,Pi,P2,P3,P4), the negative bino- 
mial distribution can be used to calculate the probabil- 
ities required to find the expected number of batters in 
an inning. Specifically: 

F(N = i) = ~ 1 W- 3 (3) 

where q is defined as the on base percentage (OBP), 
expressed as q = 1 — p . 1 Therefore, taking the expec- 
tation of the above expression gives E(iV) = 3/po, the 
expected number of batters in an inning. Note that the 
OBR q , for a given type of hitter will differ. To illus- 
trate the meaning of "type of hitter", assume we clas- 
sify batters as either singles, doubles, triples or home 

'Note that this is the case because every other hitting outcome 
consists of the batter getting on base, and so subtracting po from 1 
gives us the probability the player will get on base. 
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run hitters. Then the OBP for a home run hitter would 
be g = 1 — Po, where p\ = p 2 = P3 = since this 
hitter only hits home runs. 

The number of runs generated by a home run hitter 
would be i? 4 = (3/po) — 3 = 3q /p , which is the ex- 
pected number of batters in an inning minus the number 
of outs (three). The idea is similar for the other types 
of hitters. Note that Ri denotes the expected number of 
runs earned for a hit of type i, where i G B, 1, 2, 3, 4. 

Using the above information, the general case of 
"pure-hitters" can be defined. Assume that N is the ran- 
dom variable for the number of batters in an inning. For 
each type of hitter, there exists a minimum number of 
runners required prior to scoring a run; as a result Cov- 
ers and Keilers (1977) define, for any real number t, 
(N - t) + = N - t if iV - t > and (N - t)+ = 
otherwise. Since will be at least three players in each 
inning, t > 3 because at least three batters are required 
to end an inning, assuming each of their at-bat outcomes 
resulted in an out. 

This means a home run hitter's expected number of 
runs is R A = E(N — 3) + because they have three outs, 
but do not require any men on base to score. Similarly 
for a singles hitter Ri = E(N — 5) + because there must 
be two men on base (one on first base and one on second 
base due to the assumptions) in addition to the three 
outs before he can score. The doubles and triples hitter 
have the same number of players on base required, one, 
which gives us R 2 = R3 = E(N - 4) + . For the all 
walks hitter, they need three men on base in addition to 
three outs and therefore Rb = E(N — 6) + . Using this 
information, Covers and Keilers (1977) give the final 
expressions: 

Rb = 3/po - 6 + 3p 3 (l + 2q Q + 2q 2 ) 
Ri = 3/po - 5 + 3p 3 Q q + 2p 3 
R 2 = R 3 = 3/p -4 + pl 
R 4 = 3/po - 3 

Where the player's OERA is the sum of these ex- 
pressions. 



The OERA calculation is airtight in the sense that 
it uses the most independent player statistic to calcu- 
late this production, which is the batting average. The 
drawback is how OERA uses this statistic. It does not 
factor in the league-wide average in order to get a true 
measure of a player's batting effectiveness. We give the 
following scenario to illustrate this problem. 

Consider a batter who's career batting average was 
0.270 with an OBP of .330, where the league- wide bat- 
ting average was 0.285 with an OBP of 0.350 during his 
career. Now consider a player from the "deadball era" 
with a career batting average of .230 and OBP of 0.3, 
where the league wide batting average was 0.21 with an 
OBP of 0.25. The second player was clearly better with 
respect to the league performance, but his OERA will 
be much lower than the first player. However, is it fair 
to say that the second player wasn't an effective batter? 
It can be argued that the second player's batting effec- 
tiveness was superior to the first player's given that the 
second player was above league average and the first 
player was not. 

The above scenario outlines a rather large caveat 
in the OERA calculations. Historically, baseball has 
shown shown fluctuations with respect to average num- 
ber of runs scored in a game, and this affects OERA 
since it doesn't give us a relative measure of players 
with respect to their peers. 

2.2 Composite Batter Index (CBI) 

Like OERA and the Scoring Index, CBI attempts to 
quantify a player's offensive abilities. A distinguish- 
ing and interesting property of CBI when compared to 
OERA is that CBI is a relative measure, meaning it at- 
tempts to gauge a player's production with respect to 
the entire league. It was developed by Anderson et al 
(1997) using Data Envelopment Analysis. Since CBI 
is a relative measure, it allows the metric to be insen- 
sitive to league-wide changes such as poorer pitching, 
rule changes, park changes, or increases/decreases in 
league averages. 
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The CBI model has one input with five outputs. This 
single input is plate appearances, which contains the of- 
ficial number of at-bats plus the number of walks. Sac- 
rifice flies/bunts and being hit by a pitch are ignored in 
the calculations. The output, Y, consists of the number 
of walks, singles, doubles, triples and home runs. 

Because CBI relies on a technique called Data En- 
velopment Analysis, a standard linear programming for- 
mulation is used: 

minimize 6 , 
subject to FA > Y , 

ex > x'x (5) 

6free, A > 

where 6 is a score that measures the productivity 
of the player relative to the rest of the league, and is in 
the range of and 1.0. This means if a player has a 
6 = 0.8, then some hitter (or combination of hitters) 
could have produced at least the same amount of each 
type of hit in 20% fewer plate appearances. A value 
of = 1.0 implies that the player is a league leader 
because they can't be surpassed by any combination of 
players in equal or less plate appearances. A is a vector 
of virtual multipliers that describe the combination of 
league leaders that are equal or greater than the player 
studied. 

Anderson et al's (1997) initial results showed that 
players were able to achieve league-leader status only 
on the basis of being able to obtain singles or walks. Af- 
ter further examination they concluded that these eval- 
uations were unreasonable because there were certain 
players who hit enough longer hits to compensate for 
the deficit in shorter hits, and consequently should sur- 
pass these short hitting league leaders. 

They argue that a player who has a larger number of 
longer hits can elect to stop on first base during these 
hits, instead of continuing to second or third base. By 
this argument, Anderson et al (1997) conclude that there 
is a dominance relationship among the types of hit. This 
dominance would allow single base hits to be the sum 



of singles, doubles, triples, and home runs, double base 
hits to be the sum of doubles, triples, and home runs, 
triple base hits to be the sum of triples and home runs, 
and walks to be the sum of walks, singles, doubles, 
triples and home runs. This "dominance transforma- 
tion" was performed as a pre-processing step. All anal- 
yses for the remaining part of their paper involved using 
this transformation. 

The outcome of this analysis indicated there were 
some (statistically unproven) trends in the historical data 
that was used. Anderson et al (1997) state that there is 
a trend towards a higher league-wide CBI which im- 
plies that batting skill is becoming more uniformly dis- 
tributed. With the recent divulging of the "steroid era" 
in baseball, this seems to be a reasonable assumption 
as players are continually trying to improve their per- 
formance. Also mentioned was the increase of league 
leaders in each year when looking at the raw numbers: 
there were six occurrences when CBI league leaders in- 
cluded ten players or more, but only one of these was 
before 1975. This lead Anderson et al (1997) to con- 
clude that it is more difficult to dominate a league than 
it was in the past. Lastly, they mention that the pro- 
portion of players with high CBI scores has increased 
and the number of players with low CBI scores has de- 
creased. 

Treating a player's offensive performance as a linear 
program is an intriguing concept, however the method- 
ology itself must be questioned. While the idea of a 
relative measure is most definitely an enticing one, is 
an optimization technique the best way to derive such 
a measure? Constraint programming's versatility can- 
not be argued, but the application to a sport in order to 
measure a player's offensive performance relative to the 
league is difficult. 

As we have seen already, with the use of statistics, it 
is possible create effective player metrics that measure a 
certain aspect of baseball. CBI and OERA seek to mea- 
sure offensive performance in different ways, conse- 
quently leading to very different models. Each of these 
models sought to evaluate individual performance, but 
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what about team performance? OERA's model con- 
sisted of 24 states, where the expected number of runs 
from each state could be calculated with a Markovian 
Recurrence. It seems logical that with a few modifica- 
tions that OERA's model can be used for each player on 
a team, allowing team performance to be measured. 

2.3 Markov Chain Approach to Baseball 
2.3.1 Method 

The Markov chain method proposed by Bukiet et al. 
(1997) sought to evaluate the the baseball team perfor- 
mance and the effect of a player on team performance. 
They noticed that "run-production" models like OERA 
did not lend insight to team performance. 

Of specific interest to them was the the influence 
of batting order on a team's performance. They also 
use this method to approximate the expected number 
of wins in a season, the number of runs of scored in 
an inning, and the influence of trading a player on the 
team wins. If a team wins a game, they must outscore 
their opponent. The best batting order is the one that 
produces the most runs. It is easy to see then that this 
model relies heavily on run production to evaluate these 
different aspects. This can be a potential issue, since 
there exist teams (such as the 2010 San Francisco Gi- 
ants) which rely on defense and pitching, instead of hit- 
ting, to win games. 

It was mentioned that with slight modifications, the 
OERA model could be used to measure team perfor- 
mance. Since team performance is being measured, the 
three-out state serves as an absorbing state denoting the 
end of a team's inning. Consequently, each player pos- 
seses a 25 x 25 transition matrix. Each entry in this 
matrix contains the probability for this player, in a sin- 
gle at bat, to change the game state to any other state. 
Because of low data availability, simple statistical mod- 
els were used with the available data for each type of hit 
to fill this matrix. 

Generating data for probabilities of each hitting out- 



come for a player can be hazardous. In contrast, OERA 
did not have this issue since only six hitting outcomes 
were required, and the baserunners' position did not af- 
fect a player's hitting probabilities. In this case, the six 
hitting probabilities for each player must be generated 
for different baserunner positions. Generating data in 
such a manner may cause a player's 25 x 25 transition 
matrix to be unrepresentative of their real- world perfor- 
mance. 

An advantage of this model is that the situational 
data can be implemented into the transition matrix with 
ease. Consider a batter who is twice as likely to hit a 
home run with the bases loaded. By multiplying all en- 
tries in the transition matrix that correspond to a transi- 
tion to a state with no runners on base and the same 
number of outs, the probability of a player hitting a 
home run in bases loaded has now been doubled. 

The 25 x 25 (block) transition matrix for every player 
has the following form: 



(6) 



Where all A, B, and C matrices are 8 x 8 block ma- 
trices, and D , Ei, F 2 are 8 x lcolumn vectors. The 8 x 
8 block matrices represent the eight possible states of 
base runners; more specifically A represents the events 
which do not increase the number of outs in the inning, 
B represents the events for which the outs increase by 
one but do not end in three outs, and the C matrix rep- 
resents the events that result in a double play (from no 
outs to two outs). The rows of each of these matrices 
represent the transition probabilities from these states. 
Therefore the 2th row and jth row of the matrices is the 
transition from the zth state to the jth state. 

From this information, we know that there are cer- 
tain transitions where one can infer that a run is scored. 
Consider the transition from the state with zero outs and 
base runners on first and second base to a state with zero 
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outs and a base runner on second base; this can only oc- 
cur when two runners score. The runs scored from these 
transitions are used to calculate the run distribution for 
each team. 

To calculate the distribution of runs in the game, the 
probability of scoring any number of runs until the cur- 
rent at bat is calculated. For a single inning, this calcu- 
lation involves a 1 x 25 vector u whose first entry is 
one, and the remaining 24 entries are zero. This vec- 
tor represents with state with no outs, and no one on 
base. Similarly, u n represents the situation where n bat- 
ters have already had their turn batting. Since n batters 
have already gone, then it is the turn of the n + 1 bat- 
ter. Therefore U n x P n+ i, where P n+ i is the transition 
matrix of the batter whose turn it is, a probability dis- 
tribution of the states in the inning after n + 1 batters. 
In order to keep track of the number of runs scored un- 
til this point in the inning, a 21 x 25 matrix Uo which 
has 21 rows to represent zero to 20 runs (the first row is 
zero runs, the last row is 20 runs), and 25 columns rep- 
resenting the current state of the inning, is maintained. 

When there is a transition that causes a run to score, 
the probability of this outcome is propagated to U n+ i. 
This relatively simple representation allows the distri- 
bution of runs to be calculated after any number of bat- 
ters as follows: 



U PiP 2 P 3 . . .P9P1P2 .. 



(7) 



where each subscript of P represents one of nine play- 
ers. Because this is for a single inning, U n is given nine 
times as many rows, where each row set of 21 rows rep- 
resents the number of runs in an inning. When an inning 
ends, the results are propagated to the next 21 rows and 
24 columns (the twenty fifth column is absorbing, and 
no scoring is done there). This computation is contin- 
ued until the probability that 27 outs have occurred is 
greater than 0.999. 

It is apparent that this model relies heavily on gen- 
erated data and an iterative stopping condition. When 
these two ideas are combined, the result may be com- 



pletely unrepresentative of the players and teams be- 
ing evaluated. Using sample data to generate proba- 
bilities of hitting outcomes in specific situations alone 
can cause results to be unrepresentative; using this data 
in an iterative computation that determines the stopping 
criteria will yield results that are unrepresentative of the 
team and player. 

2.3.2 Scoring Index 

One of the primary goals in developing the Markov 
Chain approach was to compute the near-optimal bat- 
ting order for a baseball team. With the Markov frame- 
work established as above, they also needed a way to 
rank a player's offensive ability. Instead of using OERA, 
they used a similar metric proposed by D'Esopo and 
Lefkowitz called the Scoring Index because of the simi- 
larity in run production calculation between the Markov 
Approach and Scoring Index. 

The Scoring Index is similar to OERA in sense that 
it uses nine copies of the batter in the lineup to calcu- 
late the offensive production, however there is ranking 
among these nine copies using a deterministic model of 
runner advancement. Also similar to OERA, the Scor- 
ing Index contains similar assumptions to ensure the 
computation is consistent, deterministic and represen- 
tative of the player's offensive capabilities. The distin- 
guishing factor is that it uses one inning of the Markov 
approach outlined above. And the transition matrix P 
for the scoring index is also 25 x 25, defined as: 



(8) 



Inspecting this P, it is apparent that it is different from 
the P given in the Markov Chain section. This is due to 
block matrices C , -Do an d E becoming zero because 
the Scoring Index ignores double and triple plays. The 
off-diagonal entries of block matrix B are zero because 
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runners cannot advance on an out. Just as OERA has 
some assumptions that may affect the calculation, the 
Scoring Index does as well. The claim was that these 
inaccuracies should "somewhat offset each other". 

A and B are block matrices with the following struc- 
ture: 
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(9) 

B = P out I (10) 

where the probabilities of getting a walk, single, double, 
triple, home run or out are denoted as P w , P s ,Pd,Pt, Ph 
and P out , respectively. I is an 8 x 8 identity matrix. F is 
a 8 x 1 vector with the entries: F = (Pout, ■ ■ ■ , Pout) 7 

We can use the theory of Absorbing Markov Chains 
to calculate the time to the three out state, which is the 
absorption state. The (i,j)th entry of matrix (J — Q)~ x 
gives the expected number of visits from state i to state 
j prior to being absorbed into the three out state. This 
is referred to as the "expected absorption time". The 
expected absorption time from state i can be calculated 
by: 

24 

E l (T) = ^(/-Q)r. 1 (11) 

and for the Scoring Index model: 

/ A B \ 
Q=\ A B J (12) 
\ OA/ 

which results in (/ — Q)^ 1 having the following struc- 
ture: 

/ R RBR RBRBR \ 
(I -Q)- 1 = R RBR (13) 
\ R J 



where R= (J - Ay 1 . 

2.3.3 Near-Optimal Batting Order 

Using the Scoring Index as a way to rank each player's 
offensive ability, the optimal batting order was calcu- 
lated with three different algorithms. These three algo- 
rithms focused on computational efficiency, since CPU 
time was not cheap at the time the article was written. 
As a result, the differences in these algorithms is omit- 
ted since most modem day computers can calculate the 
9! (362,880) permutations relatively quickly. It was ob- 
served that the lineup that produces the most runs was 
also the one with the most wins, which should follow 
naturally if the batting order is indeed optimal. 

By comparing the worst lineup to the team's near- 
optimal one, the results suggested the following criteria 
when constructing a team's batting lineup: 

(i) The batter with the highest scoring index should 
bat second, third or fourth. 

(ii) The batter with the second highest scoring index 
should bat between the first and fifth positions. 

(iii) The batters with the third and fourth best scoring 
indices should bat between the first and sixth po- 
sitions. 

(iv) The batter with the fifth highest scoring index should 
bat first, second or between the fifth and seventh 
positions. 

(v) The sixth best batter can bat in any position ex- 
cept eighth or ninth. 

(vi) The batter with the seventh highest scoring index 
can bat either first, between sixth through ninth 
positions. 

(vii) The batters with the lowest and second-lowest scor- 
ing indices should bat in the last three positions. 

(viii) Either the batter with the second or third highest 
scoring index should bat right after the best batter. 

(ix) The batter with the lowest scoring index should 
be four to six positions after the best batter. 

(x) The batter with the second lowest scoring index 
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should be four to seven batters after the best bat- 
ter. 

These ten possible criteria seem very reasonable, 
however there are no conclusive results showing that 
lineups following this criteria are optimal in compar- 
ison to the team's regular lineup, or whether this near- 
optimal lineup yields more runs (and consequently more 
wins) than the regular lineup. Further evaluation of this 
method is required; specifically there should be a com- 
parison between the near optimal lineup of the team ver- 
sus the regular lineup to determine whether this lineup 
will yield more wins. There should also be testing on 
more recent data, since the analysis was done on a sin- 
gle year of data (from 1989). 

2.3.4 Conclusion 

This article was the first to use an offensive metric (Scor- 
ing Index, similar to OERA) for nine players on the 
same team to evaluate the team performance. As men- 
tioned earlier, the methodology in this article may have 
affected the significance of the results. With the recent 
abundance of data provided by MLB's GameDay sys- 
tem, it should be possible to calculate each player's hit- 
ting probabilities in specific situations. Using such data 
may yield a much more promising result if this experi- 
ment is repeated. 

It is apparent that evaluating team performance is 
difficult if there isn't an accurate representation of each 
player. So far we have only considered the offensive 
output of a player while assuming their performance 
is not affected by the environment. However, statis- 
tics show that players performance is affected by spe- 
cific environments. When considering a player's per- 
formance, whether it is offensive or defensive, their per- 
formance in different conditions must be considered if 
the team wants to maximize the probability of winning 
each game. 



2.4 Breakdown Statistics 

Albert (1994) broke down players' annual statistics in 
order to test whether batter performance can be affected 
by certain situations. As an example: if a batter bats 
poorer when he's not playing on home field, is it be- 
cause of the playing surface or the fact that he isn't 
comfortable playing away from home field? In order 
to do this, eight common situations in which players' 
averages could be affected were evaluated: 

(i) Opposite side versus the same side. This means 
that when a pitcher is right-handed and throwing 
to a right-handed batter, this is the "same side". 
When a pitcher is left-handed throwing to a right- 
handed batter, it is the "opposite side". 

(ii) Whether a pitcher is a "groundball pitcher" or 
"flybail pitcher". This means that this pitcher re- 
lies on batters hitting ground balls (or fly balls) to 
the pitcher's defense in order to achieve an out. 

(iii) Home versus away. 

(iv) The playing surface (grass versus turf) 

(v) When a batter is "head in the count" versus two 
strikes in the count. Being ahead in the count 
means there are more balls than strikes (exception 
of three balls and two strikes, which is referred to 
as a "full count"). 

(vi) Runners in scoring position versus no runners in 
scoring position and no runners out. 

(vii) Performance in the first "half of the season ver- 
sus the second "half. The halfway point is deter- 
mined by the All Star game. 

It has been shown that there is a high correlation 
between player performance and whether they are play- 
ing at home or away from home. In order to evaluate 
the performance in each of these situations, the number 
of hits and the at bat for home and away games were 
recorded for 154 players. For the zth player, this can be 
represented as a 2 x 2 contingency table: 



■i2 Oi2 
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where hn, On, and aba denote the number of hits, outs, 
and at bats during home games; p a and p i2 denote the 
probabilities that the ith player gets a hit at home or 
away, respectively. It is assumed that the batting at- 
tempts are independent Bernoulli trials with the associ- 
ated probabilities of success, and that the number of hits 
hn and h i2 are independently distributed according to 
the binomial distributions with the parameters (abn , pn) 
and (abi2,Pi2) respectively. 

This information is transformed to approximate nor- 
mality by use of the logistic transformation: 



Vij = log 



hi 



(15) 



which allows yn and y^, to be independent normal with 
each yij having mean ^ = log(py/ (1 — pij)) and vari- 
ance of = (abijPij(l —pij)Y l ■ If the contingency table 



is now expressed as 2 x N we can represent every player. 
With this representation, we know that is the logit of 
player i's observed batting average in situation j. This 
allows the mean of y^ to be represented as : 



Hij = E(yij) = Hi + oii 



(16) 



where fii represents player i's hitting ability and oty rep- 
resents the situational effect that attempts to measure 
the change in this player's hitting ability due to situa- 
tion j. 

The ability parameters ^i, . . . , ji N are assigned in- 
dependently assigned flat noninformative priors as they 
are nuisance parameters. The reason the ability param- 
eters are considered as nuisance is because the goal is 
to measure the effect of the environment on the player's 
hitting ability, which is given by the situational effects . 
Since the situational effects are of interest, 

it is assumed a priori that a±, . . . , ccjv are independently 
distributed from some common population ir(a). 

The prior distribution used was a t distribution with 
mean \i a , scale parameter a a and a v = 4 degrees of 
freedom. In order to reflect the lack of knowledge about 
the size of the situational effect, p, a is assigned a non- 



informative prior. For a\, an informative prior is con- 
structed using the home/away variable as the the repre- 
sentative for the situational variables. The prior of a 2 a is 
based upon a posterior analysis of the home/away situa- 
tional variable in previous seasons. This prior distribu- 
tion is used in the posterior analysis for every situational 
variable. 

The results used a Gibbs sampler to simulate the 
posterior distributions. For a sample size of 1000, a pos- 
terior distribution was obtained with parameters ({/ij}, {aj},/i a 
These parameters were used to estimate the functions 
of the parameters of interest, primarily the a±, . . . , a^. 
The final data showed that the spread of each popula- 
tion for a given situation was roughly the same, allow- 
ing the differences between situations to be explained in 
terms of a shift. This is exemplified by the "groundball- 
flyball" effects, which are approximately ten batting av- 
erage points higher than the "day-night" effects. 

While nearly all of the situational effects being mea- 
sured did not reveal a significant difference in batting 
average, the effect of pitch count did. Specifically, when 
a batter was "behind" in the pitch count, they hit 123 
average points lower than when they were "ahead" in 
the count. A less significant result was that batters also 
hit 20 average points higher when facing a pitcher that 
was on the opposite side, 1 1 points higher when they 
were facing a groundball pitcher and eight points higher 
when at home. However, a caveat to these results is that 
these situational patterns are only apparent for a group 
of players, and not individuals. This was evidenced by 
nine players who showed extreme estimated effects for 
one season, but many exhibited the opposite sign for the 
previous four years. This inconsistency led to the sug- 
gestion that season performance might be an imperfect 
form of measurement of situational abilities, and that 
play by play data might be more helpful. 

There is yet to exist any published work which sta- 
tistically analyzes baseball at the pitch by pitch level. 
It was mentioned earlier that the depth,availability and 
accessibility of baseball data recently suggests that this 
topic should be re-visited. 
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What is most enticing about this article is the level 
at which it attempts to make inference. It is likely that 
if data was obtained at the pitch by pitch level, the situ- 
ational effects could be measured at the individual level. 
The importance of evaluating situational effects in sports 
cannot be understated, since performance can be sig- 
nificantly impacted depending upon the situation and 



in attempt to infer a player abilities and tendencies so 
that this player can be played optimally when defend- 
ing. Namely, they attempted to model the player's shot 
frequency and location. The player used for analysis 
was Sam Cassell of the Minnesota Timberwolves dur- 
ing his 2003-2004 season in the NBA. The data was 
about 90 % complete, since six games were missing and 



player. A question that arises from these results is whethera few shots were not recorded on the shot chart. 



it is possible to simply take account of these situational 
effects, and then attempt to evaluate a player's perfor- 
mance, rather than trying to measure the magnitude of 
this effect. 

The OERA/Markov Chain approach both rely on 
exploiting the Markov Property, whereas Albert (1994) 
's evaluation of breakdown statistics relies on Bayesian 
Inference. Is it possible that a combination of these 
two techniques can result in a better evaluation of a de- 
sired phenomena? The next two articles rely heavily on 
Markov Chain Monte Carlo (MCMC) methods to gen- 
erate the sample points that are used in their analysis. 
Particularly interesting is how each article analyzes a 
different sport, but both use similar techniques to ob- 
tain their results. 

2.5 Spatial Analysis of Shot Chart Data in 
Basketball 



The analysis used the shot location, angle from the 
center of the basket, time between last shot attempt (game 
clock time), shot outcome (make or miss) to compute 
ten covariates that affected the player's shot selection. 
The purpose of these covariates were to account for 
different situations when the player was on the floor. 
That is, certain covariates accounted for the change in a 
player's performance. It should become apparent to the 
reader that the function of these covariates is equivalent 
to the situational effects in Albert's (1994) article. 

Lastly, shot selection was defined as the location, 
success and frequency of shots taken, each of which 
were calculated separately. The shot location was con- 
verted from euclidean coordinates to polar coordinates 
so the relationship between the angle and distance at 
which the player took a shot could be explored. 

2.5.1 Shot Frequency 



Statistical analysis in basketball is difficult, largely 
due to the limited amount of information that is given 
in the data. If we compare the analysis of basketball to 
baseball, what makes baseball "easier" to analyze sta- 
tistically than basketball is the low dependence on spa- 
tial information of player movement. In basketball, the 
way a player moves up and down the court impacts his 
shot selection and his decision making, because they are 
allowed to take shots from anywhere. Contrast this to 
baseball, where, upon successfully hitting the ball for a 
base-hit, the player must run on a fixed (base) path to 
reach base. 

Reich et al. (2006) used basketball shot chart data 



To model the player's shot frequency, Reich et al 
(2006) calculated the median time between each shot 
attempt when the player was in the game and explored 
how this value was affected by the ten selected covari- 
ates denoted Xf., k E {1, . . . , 10}. One limiting factor 
that should be mentioned is that some of the covariates 
could only be calculated at the time of a shot attempt. 
The authors attempted to use an error- in-covariates model 
to correct for this, but the results were not any better. 

By using multiple linear regression, 

Y = X8 + b 

where Y = log(Tj) was logarithm between the time 
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of shot attempts % — 1 and i, and X was a matrix con- 
taining columns for each effect used in calculating the 
covariates , interaction between select covariates, and 
the homogenous error variance. By using the inverse 
gamma distribution IG(a\, a 2 ), with densities propor- 
tional to (erg— ai + l)exp(—a 2 /crf), IG(0.01,0.01) was 
defined as the prior for error variance, and diffuse nor- 
mal priors of mean and variance 1,000 were selected. 
Their results were analyzed using the posterior means 
with 95% confidence intervals for the exponentials of 
the regression parameters (exp bk) (the fitted multiples 
of the median times between shots relative to the zero 
condition). 

2.5.2 Shot Location 

As mentioned earlier, polar coordinates were used 
for shot location in order to develop a relationship be- 
tween the distance and angle of a shot location. To an- 
alyze shot locations, the court was divided into 11 re- 
gions to capture the 1 1 distinct angles from which the 
player attempted a shot. This resulted in discretizing 
the court into an 11 x 11 grid to make the computa- 
tion feasible. An additional cell was added underneath 
the basket since shot attempts from this location are of- 
ten after rebounding, or when the player is "driving to 
the net" (running directly to the basket in hopes to get 
a close shot). Because the location of shot was taken 
as a response to as many as four covariates, and 48 of 
the 122 cells contained as few as four shot attempts, the 
neighbouring cells shared information to stabilize the 
covariate effects. The goal was to find the effect of a 
certain subset of covariates on each cell where shot at- 
tempts were taken. 

Multinomial regression was used, where yi G {1, . . . 
was the region of the ith shot 2 , and it was assumed that 
ViWiVi) followed a multinomial distribution (#1(77^), . . . 
p (ffc)) where Ofa) = (fcfa), • • • , P (»7i)) and 
was the probability that the ith shot was taken from re- 



gion j. The probability of each cell depends on shot i's 
p- vector of linear predictors rj i = log A + x.jb. log A 
is a p-vector of offset terms and log Aj is equal to the 
area of region j in euclidean distance. This was essen- 
tial since each of the 1 1 regions had differing areas. Xj 
was a g-vector of covariates and because each of the p 
regions had their own set regression coefficients to mea- 
sure the q covariates: b.j = (bij, . . . , b q j) T for the jth 
region. Because h.j and x/s are related through ^-(r/J, 
a multinomial logit model was used to calculate 9j{r]^): 



exp(logAj +xfb.j) 
£f =1 exp(logA/ + xfb. 



If bfc. is the spatially varying coefficient of covari- 
ant k, is assigned a CAR (conditionally autoregressive) 
prior CAR(r fc ), then bj k \b k j d ^j is normally distributed 
with mean b k j and precision (which is inverse variance) 
Tknij, where bj k is the mean of h k . at region j's rrij 
neighbors, > is the degree of smoothing of each 
bkj towards its respective neighbors. 

In order to share information among cells that was 
realistic, the idea of distance and angle neighbors were 
established. An angle neighbour were adjacent cells 
that were the same distance from the basket , and a 
distance neighbour were adjacent cells that were the 
same angle from the basket. The 2NRCAR (two neigh- 
bour conditionally autoregressive) model was used to 
allow different amount of smoothing between the angle 
and distance by adding parameter (3 k G (0, 1) which 
quantifies the amount of "influence" each type of neigh- 
bour has on the respective cells. 

Using 2NRCAR(rfc, /3fc) as a prior, the distribution 
of bkj\bki,i^j is normally distributed with mean: 



,p} E(b k j\bki,ijtj) = b ajk 



m aj (3 k 



+ 



m aj (3 k + m dj (l - (3 k ) 



m dj (3 k 



(17) 



m aj /3 k + m dj (l - f3 k ) 



p, in this case, was 122 for the number of cells in the grid 



with precision T k (m a j/3 k + m<#(l — (5 k )), and similar to 
the CAR model, b ak j and b dk j are the mean of b^. for 
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region j's rridj and m a j distance and angle neighbors, 
respectively. With the above formulation,^!^ to 
have a weighted average of b a kj and bdkj, and (5 k deter- 
mines the weight of each neighbor type with r k control- 
ling the smoothness of h k .. It should be intuitive that 
0k > 0.5 smoothes angle neighbors more than distance 
neighbors, and (3k < 0.5 smooths distances neighbors 
more than angle neighbors. 

Sharing information by use of smoothing ensures 
that the computation remains numerically stable. The 
elegance of the above formulation lies in the fact that 
Reich et al. (2006) distinguished neighbouring cells in 
two different manners. By doing so, smoothing could 
be performed with on either distance or angle neigh- 
bours if the respective neighbours were suffering from 
data sparsity. Such a technique is beneficial in alleviat- 
ing the issue of data sparsity given a limited amount of 
data. 

2.5.3 Shot Success 

Assuming that the shot location is fixed the prob- 
ability of a successful shot attempt from each each of 
the p cells is analyzed. To represent realistic in-game 
behavior, covariates are allowed to vary over different 
cells is allowed because the opposition's defensive pri- 
orities may change over time, or the player's team may 
change their offensive strategy. This can be easily seen 
when the player's team is up by a significant amount of 
points, since the intent is to then protect the lead late 
into a game, which implies less offensive production. 
The opposition's defensive strategy can change when 
they are losing the game due to their defense being ex- 
posed by a certain player. As a consequence, the de- 
fense would then put priority guarding this player, caus- 
ing this player's covariates to change. 

To model shooting percentage, separate logistic re- 
gressions were each of the p = 122 locations. The re- 
sulting regression parameters for each of these p cells 
were also smoothed spatially. The probability of mak- 



ing ith shot was drawn from a Bernoulli^) distribu- 
tion, where G {1, 0} if the shot was made or missed, 
andlog(7Ti) = log (^/(l - 71-;)) = Xib. y . where y { is the 
region of the ith shot, and b . y . is the g-vector of regres- 
sion parameters associated with the region yj. 

2.5.4 Conclusion 

In the article, Reich et. al (2006) propose a very in- 
teresting methodology using simple spatial data to eval- 
uate a player's shot selection by using covariates that 
are inferred through the game's data. Both Reich et al. 
(2006) and Albert (1994) attempt to account for situa- 
tional effects on the player, but Reich et al (2006) also 
sought to evaluate their effects on a player's shot selec- 
tion using sampled data (with the appropriate parame- 
ters). 

A small drawback is that results given in the article 
are almost entirely observational, thereby making eval- 
uation of this method difficult. However, this article 
attempts to address the environmental affects on an ath- 
lete, using smoothing techniques which had not been 
attempted prior. By discretizing the state space into a 
fine grid, a cell that may have an insufficient amount of 
data points can be stabilized by use of smoothing on all 
cells that share some spatial property, such as distance 
or angle from the basketball net. It is possible that use 
of smoothing may improve evaluation of player perfor- 
mance. 

2.6 Bayesball: Evaluating Fielding in Ma- 
jor League Baseball 

Earlier, we mentioned that statistically analyzing base- 
ball is "easier" due to the low dependence on spatial 
information. However there is one case where spatial 
information is important: Defense. How can we accu- 
rately analyze the defensive aspect of baseball? What 
separates a good defensive player from a poor one? 
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Recently, Jensen et al (2009) developed a defensive 
metric that sought to evaluate a player's fielding ability 
with higher accuracy than current metrics such as UZR 
(Ultimate Zone Rating). UZR discretizes the state space 
(baseball field) into 54 zones and evaluates the number 
of successful plays made by a defensive player in each 
zone. One shortcoming with UZR is the granularity of 
the state space. It is difficult to accurately measure a 
player's fielding ability if the area of each zone is large, 
because the number of plays a fielder makes in a large 
zone doesn't necessarily lend insight to his mobility and 
skills. This can be illustrated by comparing two play- 
ers who play the same defensive position (and therefore 
start at roughly the same position prior to the catch). If 
one player catches a ball in play (BIP) little movement 
(i.e. the BIP is a pop fly), and the other player catches 
a ball that could be 20-30 feet in front (or behind) him, 
UZR will evaluate each player's play for that zone as 
the same. 

To alleviate this shortcoming in existing defensive 
metrics, Jensen et al. (2009) sought to model the suc- 



(i) Flyball / liner model: The (x, y) location of the 
BIP is set to the location where the ball landed, 
or where it was caught (if it was caught). The 
probability of making a catch was modelled as 
a function of the distance travelled to catch this 
ball, the direction he was travelling, and the ve- 
locity of this BIP. The distance travelled incorpo- 
rates two dimensions because the playing field is 
a two dimensional plane. 

(ii) Grounder model: The (x, y) location of the BIP 
is the location where the ball was fielded by ei- 
ther the infielder or outfielder, depending on if the 
ball made it through the infield. The probability 
of a fielder making a catch is a function of dis- 
tance, direction, and the velocity. However, this 
model's direction is measured by an angle where 
the BIP was caught and the infielder's starting lo- 
cation. The distance is one dimensional since the 
infielder now travels the arc length between the 
location of the catch and the starting position. 



cess of a fielder on a given BIP as a function of that c . c w ■ ■ *u 

& The reason for this formulation is intuitive: the po- 

BIP's location. To do this, a hierarchical bayesian model ... , , . , , c T , a , „ 

3 sitions being evaluated have the ability to field flyballs 

was fitted to evaluate the individual success of each fielder , . . c , , , , c , 

and liners, but infielders almost always held grounders. 



while sharing information between players at the same 
position [[61. The idea of sharing information between 
players at the same position was motivated by Reich et 
al (2006) 's work, where the relationship is now based 
on position of the player. This is a valid relationship 
because playing the same position suggests a spatial re- 
lationship both in distance and angle from the home 
plate. The values that were obtained by this method 
were named SAFE, or Spatial Aggregate Fielding Eval- 
uation. 

2.6.1 Models used 



The models were fitted according to the type of BIP 
(Flyball, Liner, Grounder) and player position. There 
are only two models for the BIPs: 



As a result the outfielders are exempt from the grounder 
model. 3 The infield positions: IB (First base), 2B (sec- 
ond base), 3B (third base) and SS (shortstop) had the 
grounder model fit to their respective position. The re- 
maining three positions: RF (Right Field), CF (Center 
Field), LF (Left Field), along with the four aforemen- 
tioned infield positions had the liner/flyball model fit to 
their positions separately. This resulted in 7x2+4 = 18 
models. Because the authors fitted each of the four 
year's data separately, this resulted in a total of 72 sep- 
arate models. 

One large similarity between the models in this arti- 
cle and the previous is their use of ball location. Instead 
of the location where a shot was taken, it is now where 
the ball landed (or caught). If a player is responsible for 



Pitchers and catchers were ignored due to the lack of data. 
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catching a ball and fails to do so, this is treated similarly 
to a missed shot in the previous article. 

Since each of the models are similar, so too are their 
mathematical formulations: 



as the probit regression model with covariate in- 
teraction, which permits different probabilities for 
the same distance travelled leftwards and right- 
wards. 



(i) Flyball / liner model: For a given fielder i, de- 
note the number of BIPs hit when that fielder was 
playing defense as rij. Since each play's outcome 
is either a success or failure, then: 



(ii) Grounder model: There are slight changes for 
this model with respect to the flyball/liner model. 
Namely, the distance is one dimensional and the 
direction is now an angle in degrees. The ob- 
served outcomes are still Bernoulli realizations as 

above, however the parameters for the cumulative 
if the 7th flyball/liner is not caught by player j. . .■ r ^ . 

J J j r j distribution function change: 

the observed outcome of a success or failure is 

Vij 



s. 



1 if the jth flyball/liner is caught by player i 



modelled as an outcome from a Bernoulli random 
variable, which has a parameter obtained from an 
underlying event- specific probability: 

~ Bernoulli (pjj). 

Each of the Bernoulli probabilities are mod- 
elled as a function of distance travelled (Dij) to 
the BIP, an indicator function that represents the 
direction in which the player was moving (F^ = 
1 if they're moving forward, Fij = if they're 
moving backwards) and which is the velocity 
of the BIP: 



$(fto + PnOij + /3 i2 o lJ ^ l] 



id a Li i 



A) 



(19) 



Pij = $(ft + fti A 



Pi2DijFij+ 



ijVijFij, 



(18) 



where % represents the angle between the start- 
ing position and the location of the BIP, is the 
velocity of the BIP, and is an indicator func- 
tion for the direction the fielder has to move to- 
wards the BIP (Lij = 1 when moving left, = 
moving right). $(•) is still a Gaussian cumulative 
distribution function. 



2.6.2 Sharing Information 

With these models being quantified, we now discuss 
the previously-mentioned idea of sharing information 
between players. When the sample size is small, players 
of the same position have the issue of large variability 
between their parameter estimates ft. If a hierarchical 
model is used with the assumption that each set of pa- 



where $(•) is the Gaussian cumulative distribu- 
tion function, and X ij; is a vector of covariates in 
the above equation. (3 i0 is the parameter that con- 
trols the probability of a fielder catching a liner/flybafimeter estimates ft are drawn from a common prior 
hit directly at them (Dij = 0), fti, ft 2 are the pa- 
rameters that control the forward (fti) and back- 
ward (ft 2 ) direction of the fielder, and ft 3 , ft 4 are 
parameters that adjust the probability of a suc- 
cessful catch as a function of velocity. Fij,Dij 
are both covariates that are functions of the (x, y) 
coordinates of the BIP. This model is recognized 



distribution, this issue is eliminated. This is done by 
drawing each set of player- specific parameters ft from a 
common distribution shared among players of the same 
position: 

ft ~ NormalO, S) 
where /i is the 5 x 1 vector containing the means and S 
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is a 5 x 5 prior co variance matrix shared by all players. 
The components of the 2th player's parameter estimates 
/3j are assumed to be independent, even though there is 
a posterior dependence between the components of this 
parameter estimate induced by the data. 

Lastly, a prior distribution must be specified for each 
position's "shared player" parameters a k : k — 
0, . . . , 4). 4 . After extensive investigation, the nonin- 
formative distribution was chosen to as: 

p(fj, k , a k ) oc 1, fc = 0, ...,4 

If we wish find the unknown parameters (3 for N 
players at a given position, then for each BIP type we 
will have aiVx5 matrix containing the parameters for 
each player at the given position, a 5 x 1 vector, /x, 
which is the vector of parameter means, and cr 2 , a 5 x 1 
vector representing the variance among each player's 
parameters. Thus for each position and BIP-type, the 
posterior distribution for parameters (3, /x, cr are sepa- 
rately estimated, 

p((3, fx, * 2 \S, X) oc p(S\(3, X)- cr 2 ). ( M , a 2 ), 

where S contains all of the outcomes Sij and X con- 
tains all location and velocity covariates Xy. The pos- 
terior distribution of the unknown parameters for each 
position and each type of BIP is estimated using Gibbs 
sampling. 

Instead of sharing information across cells, as was 
the case in the previous article, the information is being 
shared across players. This is because the previous ar- 
ticle focused on how the covariates affected shot selec- 
tion, where shot selection must fall in a certain cell. Just 
as sharing information with certain neighbouring cells 
was important in stabilizing the computation in Reich 
et al's (2006) article, sharing information among play- 
ers is important since each player has a limited amount 
of fielding data over a season. By pooling all of this in- 
remembering that only players of the same position share in- 
formation! 



formation together for players with the same position, 
sampling from the appropriate representative distribu- 
tion, this becomes less of an issue. Both articles rely on 
using MCMC techniques in conjunction with sharing 
information to achieve their result. 

2.6.3 Comparing results 

After extensive discussion and evaluation of SAFE, 
the results 5 for each position are compared to UZR. The 
difference between SAFE and UZR is that SAFE seeks 
to evaluates the number of runs that were saved (or cost) 
by a player's defensive play, while UZR records ac- 
tual observations. To compare SAFE and UZR, correla- 
tion for each player's respective values was attempted, 
but yielded an inconclusive result. This shouldn't be 
surprising, because there is no standardized metric that 
measures a player's defensive play. 

To allow further comparison, the assumption that 
player ability is constant over time was added. This 
assumption would be supported if the player's value 
maintained a high consistency across seasons, since the 
noise in the data would be overcome by the "true sig- 
nal". When comparing SAFE'S correlation across sea- 
sons to UZR's, it did well for the outfield positions but 
struggled for infield positions such as shortstop and sec- 
ond base. SAFE and UZR's correlation over these sea- 
sons was then averaged over all seven positions, where 
SAFE had a slightly higher average correlation than that 
of UZR. This should also not be surprising, since SAFE 
sought to evaluate defensive play better than UZR by 
using an elegant, intuitive and sophisticated representa- 
tion for evaluation. 

Another interesting question that can be posed is: 
Can SAFE's infielder model be adjusted such that it 
shows a higher level of correlation across seasons? SAFE's 
results show correlations of (0.525,0.594,0.444,0.503,0.287) 
for positions (CF,LF,RF,3B,1B) across seasons, but a 

5 Any reader that is curious about the results for which players 
were evaluated are encouraged to check the article. 
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meagre (0.051,-0.03) for (2B, and SS) respectively. The 
intrigue behind this lies within the fact that the SS and 
2B positions are closer than any other two defensive po- 
sitions on the field. The distance and angle for which a 
2B might need to move to catch a BIP to his right might 
depend on the calibre of the team's shortstop. Is there a 
minor adjustment to the infield model that can account 
for the 2B/SS proximity, and eliminate the possibility 
that both the SS and 2B's ability to make plays are af- 
fected by the BIPs they can catch? In any case, the 
results and methods used in this article demonstrate the 
power of statistical analysis in sport when a statistician 
is equipped with computational power. 

The articles covered so far accurately embody the 
different ways statistical analysis has been performed 
in sport. In the last paragraph the term computational 
power was mentioned. If SAFE was proposed 15 years 
ago, it would have been computationally infeasible. As 
computational power has gotten cheaper, statistically 
analysis has benefitted. Another beneficial use of this 
computational power is to perform statistical analysis 
using techniques in Artificial Intelligence. The area 
of Artificial Intelligence contains numerous algorithms, 
many of which are built on statistical theory, that have 
the potential to improve statistical analysis in sport. We 
conclude this review by exploring a case study which 
applies an Artificial Intelligence technique to football. 

2.7 Modelling using Neuro-Dynamic Pro- 
gramming 

As we have seen, modelling sports using various statis- 
tical models is challenging. A case study performed by 
Patek and Bertsekas (1996) used Neuro-Dynamic pro- 
gramming (NDP) to simulate the offensive play call- 
ing in football flTJ. The reason to use football as the 
"testbed" was due to it's state space: it lies on the bound- 
ary between medium-scale (tractable) and large-scale 
(computationally infeasible) problems. The reason this 



This discretization is equivalent to the two articles 
previous because it seeks to represent a real- valued state 
space accurately, but also efficiently. Similar to the pre- 
vious articles, sample data that was representative of 
typical play was generated to ensure that biased approx- 
imations wouldn't result. This was performed by using 
a Gibbs sampler. 

A few rules were enforced to create a simplified ver- 
sion of football. Consider a single offensive drive in 
the middle of a game that's infinitely long. The ob- 
jective for "our" team is to maximize the score during 
"our" team's offensive drive offset by the rival team's 
expected score from where they receive the ball. The 
state of "our" team is represented by integers x, y, d 
where x is the number of yards until the goal is reached, 
y is the number of yards remaining to reach a first down, 
and d is the down number. The offensive drive for either 
team will terminate when: 

(i) A team fails to get a new set of downs. That is, 
they fail to achieve a first down after four plays. 
For those unfamiliar with American Football: A 
team always begins its offensive drive with a first 
down and must achieve y yards in four plays to 
receive a new first down, and a new y value. If 
a team doesn't achieve y yards in one play, the 
down number d is incremented, and y is decre- 
mented by the number of yards obtained from this 
play. 

(ii) Touchdown is scored. (This means the offense 
must achieve x < 0.). 

(iii) A turnover either through a fumble on a run at- 
tempt, or interception during a pass attempt, or 
whenever a team elects to punt or attempt a field 
goal. 

The outcome of a given offensive drive is random, de- 
pending on the quarterback's strategy and the associated 
transition probabilities for the diverse play options and 
points in the state space. The decision-maker has four 



problem isn't computationally infeasible is due to cfocretzp^Qptions from which he can choose: 0- Run, 1-Pass 
of the yardage, which in the real-world is real - valued, (attempt), 2-Punt, 3-Field Goal (attempt). Further in- 
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formation on each type of play option can be obtained 
by consulting the literature. 

With these rules and ideas established, the prob- 
lem of maximizing total expected reward can be rep- 
resented as the stochastic shortest path problem be- 
cause "maximizing net reward" is equivalent to "min- 
imizing net costs"; therefore NDP methods mentioned 
in Appendix O are applicable. As mentioned above, 
there are a finite number of states for which quarter- 
back should have a control action in mind. Such states 
where the quarterback will have a control action in mind 
will be denoted by % G S, where S is finite. The triple 
(xj, yi,di) denotes the number of yards to the goal, yards 
to the first down, and the down number corresponding 
to state i G S. By abuse of notation, (xi, y^ di) is writ- 
ten as (x, y, d) because there is only one triple for each 
state i G S and vice-versa. 

The quarterback's policy is represented as a func- 
tion {j, : S — > U, where U = {0, 1, 2, 3}, the control 
options available to the quarterback. Whenever posses- 
sion is lost by "our" team, we transition to an absorbing 
state T with zero-reward, similar to transitioning to the 
three-out state in the Markov Chain model for baseball. 
The quantity of the reward g is the score received at the 
end of "our" team's drive minus the expected score ob- 
tained by the opposing team starting at the given field 
position. 

With this representation, the optimal policy found it 
best to run between x = 1 and x = 65, attempt to pass 
from x = 66 to x = 94, and run again from x = 95 to 
the goal (x = 100). The reward function that used this 
optimal policy showed yielded an expected reward of 
-0.9449 points when starting from x = 20. This meant 
that if the "our" team received the ball at x = 20 every 
time, they would lose the game. This result is a function 
of arbitrarily-set parameters in the model. If they are 
adjusted sufficiently, positive reward could be obtained. 

What was also interesting was when y was varied 
for each x, because y can be as large as x (although the 
likelihood of y > 20 is very small). On second downs, it 
was found that the optimal policy dictates pass attempts 



be made for a wide array of x, y values, with the remain- 
ing x, y values as run attempts. For third down, the op- 
timal policy recommended a pass attempts, but if x and 
y were large enough, the policy suggested punting. For 
fourth down, the policy had a diverse recommendation; 
if "our" team was close enough to the goal or a first 
down, a running or passing play was recommended. If 
either a first down or the goal is far away, either a field 
goal attempt or punt was recommended. By producing 
a policy that is in agreement with commonly-employed 
football strategy, Neuro Dynamic Programming shows 
promise for devising the "optimal" plan when playing a 
general opponent. 

Particularly advantageous to using NDP techniques 
was the ability to hypothesize a class of policies that 
represented legitimate football strategy. These policies 
can then be simulated. Using a heuristic policy that 
is reflective of most good "play-callers" in football, it 
was found that this policy had an expected reward was - 
1 .26 which is .32 game points worse than optimal when 
starting from state i* -H- (xi* = 80, yi* = 10, d^ = 
1). This is interesting because even though the result 
is worse than the optimal policy, the fact that an ar- 
bitrary heuristic policy could be evaluated suggests the 
optimal policy could be surpassed with a combination 
of (unbiased) parameter adjustments and refinement of 
the policy itself. 

We see from these results that, even though this is a 
simplified version of football, that a Markov represen- 
tation of sports combined with techniques in Artificial 
Intelligence (such as NDP) hold promise for analysis 
in sports. This is shown by finding an optimal policy 
that is realistic and reflective of play calling in football. 
In baseball, it is possible that a similar model could be 
constructed. 

3 Conclusion 

Throughout this survey, we have been exposed to dif- 
ferent methodologies and techniques that all had simi- 
lar goals: an accurate representation of either players, 
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teams, the performance of players and teams, and per- 
formance of players in select situations. 

An alternative methodology for evaluating players 
is to construct predictive models that are trained on real- 
world data. NDP produces a promising result, espe- 
cially considering how it was trained on sampled data 
that was representative of the real-world. It could be 
the case that training on real world data produces a bet- 
ter result than the one obtained above. Further experi- 
mentation is required in order to determine the validity 
of applying NDP to sports. 

Aside from the previous article using NDP, the lit- 
erature for statistical analysis in sports using Statistical 
Learning techniques is sparse. The computer is not just 
a computational tool, it possesses the ability to greatly 



If we assume we have more than one absorbing state, 
then denote the set of these states by 8. If we denote the 
set of non-absorbing states as S = {si, . . . , s n }/{£}, 
then we can construct matrices Q and S such that matrix 
Q is of dimension |«S| x |«S| and matrix S is of dimen- 
sion \S\ x \8\. The matrix S is the transition matrix of 
the states in S to states in 8. Matrix Q is the transition 
matrix from states in S to states also in S. 

Using these matrices Q, S, we can find the proba- 
bility of reaching a state in 8 from states in S by the 
following expression: 



(i-Qr's 



(20) 



which also yields a |«S| x \8\ matrix. The entries of 



improve statistical analysis in sport when Statistical Learn- tnis matrix § ive the Probability of entering an absorbing 
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ing algorithms, which possess elegant mathematical prop- 
erties, are used. These elegant mathematical properties 
are exploited to not only reduce the amount of compu- 
tation, but provide intricate analyses that further sub- 
stantiate a statistician's results. Application of these 
techniques to sports will allow statisticians to investi- 
gate complex situations such as the effect of runners 
on-base on a player's pitching strategy. Investigation 
of such complex situations in areas of sport may lead 
to development of precise player metrics that quantify 
players' decision making abilities. 



state in 8 from a state in S. The matrix (J — Q)~ 
is referred to as the fundamental matrix. The (i,j)th 
entry of this matrix is the expected number of periods 
that the Markov Chain spends in non-absorbing state j 
given that the chain began in state i. 

As we will see in section[2l Absorbing Markov Chains 
are important for certain representations in the game of 
baseball. 

Appendix B: Gibbs Sampling 



Some papers in this review will use the idea of Gibbs 



Appendix A: Absorbing State Markov Chains 



sampling, which is an algorithm that generates a se- 



Assume we are given a simple n-state Markov chain 
(Xt)tZi and states {s±, . . . , s n } with an associated nxn 
transition matrix P. The ith row and jth column of tran- 
sition matrix P correspond to the probability P(i,j) = 
V(X t = j\X t - 1 =i). 

An Absorbing Markov Chain is Markov Chain where 
there exists at least one absorbing state. An absorbing 
state is defined as a state which, upon transitioning to, 
cannot be exited. Mathematically, this is state s k , where 
1 < k < n, such that P(k,k) = F(X t = s k \X t=1 = 
Sk) = 1. 



quence data samples from a multivariate probability dis- 
tribution. The point of this generating is to approximate 
the joint posterior distribution in order to marginalize 
with respect to some subset of the variables. If we have 
random variables X, Y, then to compute f(x): 



/(*) 



f{x,y)dy 



however, in practice this can be a difficult integration. 
Often in statistical models, either f(y\x)orf(x\y) are 
available, and the Gibbs sampler generates samples of 
f(x) from these distributions. The generated sequence 
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of random variables is called a "Gibbs sequence", and • a E (0, 1] is the discounting rate for rewards in 
the act of generating this sequence is called "Gibbs Sam- the future 

pling": • The expectation over all trajectories of states {s k } 

that are possible under n. 

Y' X 1 Y' X' Y' X' Y' X' 

When looking at the stochastic shortest path prob- 
where F ' = y' is specified and the remaining samples lem and the discounted reward problem, the optimal 
are iteratively generated from: reward f unct ion from each state is: 



X' k f(x\Yj = y' 3 ) 

y; + i f(v\x$ = 4) 



(2i) r 



With the condition that k is large enough, the distribu- 
tion of X' k is said to converge to f(x), the "true" marginal 
probability distribution function for random variable X, 
when k — > oo. Therefore, if we have a large enough se- 
quence of samples, then X' k = x' k is said to be a "true" 
sample from the distribution of random variable X 0). 



Appendix C: Neuro Dynamic Programming 

When dealing with the general class of Markov Deci- 
sion processes, we are usually interested in finding a 
policy (probability distribution) such that our long-term 
total reward is maximized. By applying a control at 
a given state, the probability distribution that governs 
the immediate reward and transitions to the next state 
is determined. Given a starting state i, the long term 
discounted reward is: 



J *( s o) = , max E< S2a k g(s k , u k {s k ),s k+1 )\s ,iT 



k=0 



(22) 



where: 



The states {s k } is a trajectory representing the se- 
quence of states in a finite state space S. 
tt = jj}, . . .} E II is the policy, which is a 
sequence of functions jj k E M mapping the state 
space S to a finite set of "allowable" controls U. 
The reward from transitioning from i to j under 
the control u is g(i, u, j) 



max 

ueu 



^2Pij(u)(g(i,u, j)+aJ*(j)) 



(23) 



This expression is often referred to Bellman's equation. 
It is the value of the control u which achieves the max- 
imum in Bellman's equation for each state i E S and 
determines the stationary optimal policy fi*. 

Because solving explicitly for this system of equa- 
tions is difficult, value iteration is used. The goal of 
value iteration is to start with a guess for J*, called J°, 
defined for all states i E S. With successive iterations, 
the function for the kth iteration is: 



J k (i)=max y2 Plj (u)(g(i,u,j)+aJ k , Vz E S 

(24) 

where, in the limit J* = lim^oo J k {i) Vi E S. Intu- 
itively speaking, this means the approximation of each 
state's reward function should approach the value of the 
optimal reward function J* as the number of iterations 
increase. 

An alternative way to compute the optimal reward 
function, J* uses the policy iteration algorithm. This 
algorithm starts with a policy /j° and evaluates for 
all states i E S. The A;th iteration for this policy is 
computed by: 



i = arg max 

ueu 



Vi E S 



(25) 

Which will converge to J* as long as the evaluations of 
are exact and S and U are finite. It is assumed that 
when these algorithms are used in this article that the 
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discount factor a — 1. 

Appendix D: Baseball Terminology 

We provide a quick description of many terms that are 
frequently used in baseball. 

• An at-bat is a plate appearance for a batter. This 
at-bat has a count starting at 0-0. The first num- 
ber represents the number of "balls" (explained 
below), of which there can be a maximum of four. 
If four "balls" are achieved, the batter is advanced 
to first base (called a "base on ball"). The second 
number represents the number of strikes, if three 
strikes are obtained then the batter is out and the 
plate appearance ends. 

• A "ball" is when the pitcher fails to throw a pitch 
in the batter's strike zone. 

• Base on balls will be referred to as "walks" for 
this paper. 

• A "flyout" is when the batter successfully hits the 
pitch but it was caught by an outfielder on the op- 
posing team. 

• A "groundball" is when the batter hits the pitch 
but it rolls on the ground. 

• A "sacrifice" is when the batter hits a pitch that 
results in either a flyout or groundout to advance 
a runner on base (which includes scoring a runner 
on third base). 

• An inning in baseball consists of each team taking 
their turn to bat. Each inning for a team consists 
of three outs, and therefore at least three batters 
will get a plate appearance prior to the other team 
taking their turn to bat. 
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